28 ◾ Bioinformatics
1.5.4 Per Sequence Quality Scores
The per sequence quality score graph is created by plotting the mean sequence quality
(Phred scores) in the x-axis against read count (frequency) in the y-axis. The graph allows
us to see if a subset of reads have an overall low quality. The ideal curve is the one that shows
the majority of the reads having an overall quality score at or over 30 (Figure 1.18a); a peak
is toward the end of the x-axis. The presence of a large number of reads with an overall low
quality will indicate a systematic problem in the run. A warning sign is displayed if the
mean quality score of the majority of reads is below 27 (Figure 1.18b). An error is displayed
if the average quality score of the majority of the reads is below 20. The low-quality reads
can be filtered out to keep only the reads that pass a quality threshold.
1.5.5 Per Base Sequence Content
The per base sequence content graph depicts the percentage of each of the four bases (A, C,
G, and T) called at each position across all reads in a FASTQ file. The positions are plotted in
the x-axis against the base percentage in the y-axis. If there is no bias and library sequenc-
ing is random, we will expect no big difference between the distributions of the four bases
in each position. The percentage of each base is expected to be close to 25% and the four
lines will run approximately parallel to each other as shown in Figure 1.19a. Any deviation
from that, such as a bias or a systematic fault, will be suspected, and hence, some sequences
may be overrepresented as shown in Figure 1.19b. Higher percentage of some bases at the
beginning of the x-axis may indicate contaminating remnants of adaptor sequences or
other contaminating sequences. A warning is displayed if the difference between any of the
four bases is greater than 10% in any position and the failure of this metric occurs if the
difference between any four of bases is greater than 20% in any position.
1.5.6 Per Sequence GC Content
The per sequence GC content graph plots the number of reads in the y-axis against the
mean GC percentage per read in the x-axis (Figure 1.20). It depicts the distribution of GC
FIGURE 1.18 Per sequence quality scores.